Developer Interface¶
Configuration¶
Certain functions of the internetarchive library require your archive.org credentials (i.e. uploading, modifying metadata, searching).
Your credentials and other configurations can be provided via a dictionary when instantiating an ArchiveSession
or Item
object, or in a config file.
The easiest way to create a config file is with the configure function:
>>> from internetarchive import configure
>>> configure('user@example.com', 'password')
Config files are stored in either $HOME/.ia
or $HOME/.config/ia.ini
by default. You can also specify your own path:
>>> from internetarchive import configure
>>> configure('user@example.com', 'password', config_file='/home/jake/.config/ia-alternate.ini')
Custom config files can be specified when instantiating an ArchiveSession
object:
>>> from internetarchive import get_session
>>> s = get_session(config_file='/home/jake/.config/ia-alternate.ini')
Or an Item
object:
>>> from internetarchive import get_item
>>> item = get_item('nasa', config_file='/home/jake/.config/ia-alternate.ini')
IA-S3 Configuration¶
Your IA-S3 keys are required for uploading and modifying metadata. You can retrieve your IA-S3 keys at https://archive.org/account/s3.php.
They can be specified in your config file like so:
[s3]
access = mYaccEsSkEY
secret = mYs3cREtKEy
Or, using the ArchiveSession
object:
>>> from internetarchive import get_session
>>> c = {'s3': {'access': 'mYaccEsSkEY', 'secret': 'mYs3cREtKEy'}}
>>> s = get_session(config=c)
>>> s.access_key
'mYaccEsSkEY'
Logging Configuration¶
You can specify logging levels and the location of your log file like so:
[logging]
level = INFO
file = /tmp/ia.log
Or, using the ArchiveSession
object:
>>> from internetarchive import get_session
>>> c = {'logging': {'level': 'INFO', 'file': '/tmp/ia.log'}}
>>> s = get_session(config=c)
By default logging is turned off.
Other Configuration¶
By default all requests are HTTPS.
You can change this setting in your config file in the general
section:
[general]
secure = False
Or, using the ArchiveSession
object:
>>> from internetarchive import get_session
>>> s = get_session(config={'general': {'secure': False}})
In the example above, all requests will be made via HTTP.
ArchiveSession Objects¶
The ArchiveSession object is subclassed from requests.Session
.
It collects together your credentials and config.
- get_session(config=None, config_file=None, debug=None, http_adapter_kwargs=None)[source]¶
Return a new
ArchiveSession
object. TheArchiveSession
object is the main interface to theinternetarchive
lib. It allows you to persist certain parameters across tasks.- Parameters
config (dict) – (optional) A dictionary used to configure your session.
config_file (str) – (optional) A path to a config file used to configure your session.
http_adapter_kwargs (dict) – (optional) Keyword arguments that
requests.adapters.HTTPAdapter
takes.
- Returns
ArchiveSession
object.
Usage:
>>> from internetarchive import get_session >>> config = {'s3': {'access': 'foo', 'secret': 'bar'}} >>> s = get_session(config) >>> s.access_key 'foo'
From the session object, you can access all of the functionality of the
internetarchive
lib:>>> item = s.get_item('nasa') >>> item.download() nasa: ddddddd - success >>> s.get_tasks(task_ids=31643513)[0].server 'ia311234'
Item Objects¶
Item
objects represent Internet Archive items.
From the Item
object you can create new items, upload files to existing items, read and write metadata, and download or delete files.
- get_item(identifier, config=None, config_file=None, archive_session=None, debug=None, http_adapter_kwargs=None, request_kwargs=None)[source]¶
Get an
Item
object.- Parameters
identifier (str) – The globally unique Archive.org item identifier.
config (dict) – (optional) A dictionary used to configure your session.
config_file (str) – (optional) A path to a config file used to configure your session.
archive_session (
ArchiveSession
) – (optional) AnArchiveSession
object can be provided via thearchive_session
parameter.http_adapter_kwargs (dict) – (optional) Keyword arguments that
requests.adapters.HTTPAdapter
takes.request_kwargs (dict) – (optional) Keyword arguments that
requests.Request
takes.
- Usage:
>>> from internetarchive import get_item >>> item = get_item('nasa') >>> item.item_size 121084
Uploading¶
Uploading to an item can be done using Item.upload()
:
>>> item = get_item('my_item')
>>> r = item.upload('/home/user/foo.txt')
>>> from internetarchive import upload
>>> r = upload('my_item', '/home/user/foo.txt')
The item will automatically be created if it does not exist.
Refer to archive.org Identifiers for more information on creating valid archive.org identifiers.
Setting Remote Filenames¶
Remote filenames can be defined using a dictionary:
>>> from io import BytesIO
>>> fh = BytesIO()
>>> fh.write(b'foo bar')
>>> item.upload({'my-remote-filename.txt': fh})
- upload(identifier, files, metadata=None, headers=None, access_key=None, secret_key=None, queue_derive=None, verbose=None, verify=None, checksum=None, delete=None, retries=None, retries_sleep=None, debug=None, validate_identifier=None, request_kwargs=None, **get_item_kwargs)[source]¶
Upload files to an item. The item will be created if it does not exist.
- Parameters
identifier (str) – The globally unique Archive.org identifier for a given item.
files – The filepaths or file-like objects to upload. This value can be an iterable or a single file-like object or string.
metadata (dict) – (optional) Metadata used to create a new item. If the item already exists, the metadata will not be updated – use
modify_metadata
.headers (dict) – (optional) Add additional HTTP headers to the request.
access_key (str) – (optional) IA-S3 access_key to use when making the given request.
secret_key (str) – (optional) IA-S3 secret_key to use when making the given request.
queue_derive (bool) – (optional) Set to False to prevent an item from being derived after upload.
verbose (bool) – (optional) Display upload progress.
verify (bool) – (optional) Verify local MD5 checksum matches the MD5 checksum of the file received by IAS3.
checksum (bool) – (optional) Skip uploading files based on checksum.
delete (bool) – (optional) Delete local file after the upload has been successfully verified.
retries (int) – (optional) Number of times to retry the given request if S3 returns a 503 SlowDown error.
retries_sleep (int) – (optional) Amount of time to sleep between
retries
.debug (bool) – (optional) Set to True to print headers to stdout, and exit without sending the upload request.
validate_identifier (bool) – (optional) Set to True to validate the identifier before uploading the file.
**kwargs – Optional arguments that
get_item
takes.
- Returns
A list of
requests.Response
objects.
Metadata¶
- modify_metadata(identifier, metadata, target=None, append=None, append_list=None, priority=None, access_key=None, secret_key=None, debug=None, request_kwargs=None, **get_item_kwargs)[source]¶
Modify the metadata of an existing item on Archive.org.
- Parameters
identifier (str) – The globally unique Archive.org identifier for a given item.
metadata (dict) – Metadata used to update the item.
target (str) – (optional) The metadata target to update. Defaults to metadata.
append (bool) – (optional) set to True to append metadata values to current values rather than replacing. Defaults to
False
.append_list (bool) – (optional) Append values to an existing multi-value metadata field. No duplicate values will be added.
priority (int) – (optional) Set task priority.
access_key (str) – (optional) IA-S3 access_key to use when making the given request.
secret_key (str) – (optional) IA-S3 secret_key to use when making the given request.
debug (bool) – (optional) set to True to return a
requests.Request
object instead of sending request. Defaults toFalse
.**get_item_kwargs – (optional) Arguments that
get_item
takes.
- Returns
requests.Response
object orrequests.Request
object if debug isTrue
.
The default target to write to is metadata
.
If you would like to write to another target, such as files
, you can specify so using the target
parameter.
For example, if we had an item whose identifier was my_identifier
and you wanted to add a metadata field to a file within the item called foo.txt:
>>> r = modify_metadata('my_identifier', metadata={'title': 'My File'}, target='files/foo.txt')
>>> from internetarchive import get_files
>>> f = list(get_files('iacli-test-item301', 'foo.txt'))[0]
>>> f.title
'My File'
You can also create new targets if they don’t exist:
>>> r = modify_metadata('my_identifier', metadata={'foo': 'bar'}, target='extra_metadata')
>>> from internetarchive import get_item
>>> item = get_item('my_identifier')
>>> item.item_metadata['extra_metadata']
{'foo': 'bar'}
Downloading¶
- download(identifier, files=None, formats=None, glob_pattern=None, dry_run=None, verbose=None, ignore_existing=None, checksum=None, destdir=None, no_directory=None, retries=None, item_index=None, ignore_errors=None, on_the_fly=None, return_responses=None, no_change_timestamp=None, **get_item_kwargs)[source]¶
Download files from an item.
- Parameters
identifier (str) – The globally unique Archive.org identifier for a given item.
files – (optional) Only return files matching the given file names.
formats – (optional) Only return files matching the given formats.
glob_pattern (str) – (optional) Only return files matching the given glob pattern.
dry_run (bool) – (optional) Print URLs to files to stdout rather than downloading them.
verbose (bool) – (optional) Turn on verbose output.
ignore_existing (bool) – (optional) Skip files that already exist locally.
checksum (bool) – (optional) Skip downloading file based on checksum.
destdir (str) – (optional) The directory to download files to.
no_directory (bool) – (optional) Download files to current working directory rather than creating an item directory.
retries (int) – (optional) The number of times to retry on failed requests.
item_index (int) – (optional) The index of the item for displaying progress in bulk downloads.
ignore_errors (bool) – (optional) Don’t fail if a single file fails to download, continue to download other files.
on_the_fly (bool) – (optional) Download on-the-fly files (i.e. derivative EPUB, MOBI, DAISY files).
return_responses (bool) – (optional) Rather than downloading files to disk, return a list of response objects.
**kwargs – Optional arguments that
get_item
takes.
- Return type
- Returns
True if all files were downloaded successfully.
Deleting¶
- delete(identifier, files=None, formats=None, glob_pattern=None, cascade_delete=None, access_key=None, secret_key=None, verbose=None, debug=None, **kwargs)[source]¶
Delete files from an item. Note: Some system files, such as <itemname>_meta.xml, cannot be deleted.
- Parameters
identifier (str) – The globally unique Archive.org identifier for a given item.
files – (optional) Only return files matching the given filenames.
formats – (optional) Only return files matching the given formats.
glob_pattern (str) – (optional) Only return files matching the given glob pattern.
cascade_delete (bool) – (optional) Delete all files associated with the specified file, including upstream derivatives and the original.
access_key (str) – (optional) IA-S3 access_key to use when making the given request.
secret_key (str) – (optional) IA-S3 secret_key to use when making the given request.
verbose (bool) – Print actions to stdout.
debug (bool) – (optional) Set to True to print headers to stdout and exit exit without sending the delete request.
File Objects¶
- get_files(identifier, files=None, formats=None, glob_pattern=None, on_the_fly=None, **get_item_kwargs)[source]¶
Get
File
objects from an item.- Parameters
identifier (str) – The globally unique Archive.org identifier for a given item.
files – iterable
files – (optional) Only return files matching the given filenames.
formats – iterable
formats – (optional) Only return files matching the given formats.
glob_pattern (str) – (optional) Only return files matching the given glob pattern.
on_the_fly (bool) – (optional) Include on-the-fly files (i.e. derivative EPUB, MOBI, DAISY files).
**get_item_kwargs – (optional) Arguments that
get_item()
takes.
- Usage:
>>> from internetarchive import get_files >>> fnames = [f.name for f in get_files('nasa', glob_pattern='*xml')] >>> print(fnames) ['nasa_reviews.xml', 'nasa_meta.xml', 'nasa_files.xml']
Searching Items¶
- search_items(query, fields=None, sorts=None, params=None, full_text_search=None, dsl_fts=None, archive_session=None, config=None, config_file=None, http_adapter_kwargs=None, request_kwargs=None, max_retries=None)[source]¶
Search for items on Archive.org.
- Parameters
query (str) – The Archive.org search query to yield results for. Refer to https://archive.org/advancedsearch.php#raw for help formatting your query.
fields (list) – (optional) The metadata fields to return in the search results.
params (dict) – (optional) The URL parameters to send with each request sent to the Archive.org Advancedsearch Api.
full_text_search (bool) – (optional) Beta support for querying the archive.org Full Text Search API [default: False].
dsl_fts (bool) – (optional) Beta support for querying the archive.org Full Text Search API in dsl (i.e. do not prepend
!L `` to the ``full_text_search
query [default: False].secure – (optional) Configuration options for session.
config_file (str) – (optional) A path to a config file used to configure your session.
http_adapter_kwargs (dict) – (optional) Keyword arguments that
requests.adapters.HTTPAdapter
takes.request_kwargs (dict) – (optional) Keyword arguments that
requests.Request
takes.The number of times to retry a failed request. This can also be an urllib3.Retry object. If you need more control (e.g. status_forcelist), use a ArchiveSession object, and mount your own adapter after the session object has been initialized. For example:
>>> s = get_session() >>> s.mount_http_adapter() >>> search_results = s.search_items('nasa')
See
ArchiveSession.mount_http_adapter()
for more details.
- Returns
A
Search
object, yielding search results.